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FINAL  REPORT:  FUSION  OF  MULTIPLE  SENSING  MODALITIES 

FOR  MACHINE  VISION 

SUMMARY 

The  Computer  and  Vision  Research  Center  at  The  University  of  Texas  at  Austin  has  un¬ 
dertaken  a  broad  program  of  research  in  machine  vision  to  develop  an  approach  based  upon  syn- 
ergistically  combining  diverse  sensing  modalities.  The  research  projects  funded  under  Contract 
DAAL-03-91-G-0050  fall  into  four  general  categories:  Outdoor  Scene  Interpretation  via  the 
Fusion  of  Multiple  Imaging  Modalities;  (2)  Motion  Computation  and  Object  Recognition  Using 
Range  Images;  (3)  Structure  and  Identity  Based  on  Color  and  Shape  Information;  and 
(4)  Autonomous  Navigation. 

Some  of  the  highlights  of  our  accomplishments  include  the  development  of  the  AIMS 
(automatic  interpretation  using  multiple  sensors)  knowledge-based  system  to  interpret  registered 
laser  radar  and  thermal  images  for  the  detection  and  recognition  of  man-made  objects  in  outdoor 
rural  scenes,  including  a  new  algorithm  for  integration  of  region  and  edge  information  without 
the  intervention  of  high-level  knowledge,  and  the  development  of  a  new  approach  for  the  detec¬ 
tion  of  large  man-made  objects  using  perceptual  organization  techniques.  We  have  developed  a 
number  of  new  algorithms  for  object  recognition  and  motion  estimation,  including  improved  al¬ 
gorithms  for  using  three-dimensional  (range)  images  to  compute  structure  and  motion;  a  CAD- 
based  object  recognition  system  which  uses  a  three-dimensional  CAD  model  of  an  object  to  lo¬ 
cate  the  object  in  a  cluttered  scene;  a  decision-theoretical  algorithm  to  estimate  3D  structures 
from  extended  sequences  of  2D  images  taken  by  a  moving  camera;  and  an  algorithm  for  match¬ 
ing  line  segments  based  on  perceptual  grouping  relaxation  labeling.  Finally,  a  significant  body 
of  work  has  been  accomplished  in  the  area  of  autonomous  navigation  with  the  construction  of  an 
autonomous  mobile  robot,  Robo-Tex,  as  a  testbed  for  navigation  algorithms,  as  well  as  a  number 
of  projects  on  position  estimation  and  calibration  techniques. 

Significant  research  findings  from  these  research  projects  have  been  presented  at  national 
and  international  conferences  and  published  in  referred  journals.  A  total  of  13  graduate  and  4 
undergraduate  students  were  supported  under  this  contract,  and  6  Ph.D.  and  1  M.S.  degrees  were 
completed  during  the  contract  term. 
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STATEMENT  OF  THE  PROBLEM  STUDIED. 


Machine  vision — the  automated  interpretation  of  sequences  of  images  to  detect  and  rec¬ 
ognize,  locate  and  track  objects — has  many  important  applications  in  the  peace-keeping  activities 
of  the  Department  of  Defense,  including  automated  surveillance  and  monitoring,  autonomous 
navigation  for  smart  weapons,  and  industrial  robotics.  For  machine  vision  systems  to  be  truly 
useful,  information  must  be  extracted  quickly  from  digitized  images.  Extraction  is  a  complex 
task:  large  amounts  of  data  must  be  processed,  noise  is  present  in  the  images,  information  may 
be  incomplete,  or  the  models  of  the  scene  and  sensors  may  be  inadequate.  To  establish  useful 
and  practical  methods  for  machine  perception  of  targets  and  guidance  of  payloads,  a  broad  pro¬ 
gram  of  research  in  machine  vision  is  needed. 

Previous  research  in  machine  perception  focused  mainly  on  the  use  of  a  single  sensing 
modality,  for  example,  a  video  or  an  infrared  camera.  However,  most  single  sensor  systems 
work  only  in  highly  constrained  environments  and  require  massive  computational  resources. 
These  limitations  can  be  overcome  by  using  multiple  sensing  modalities  and  developing 
“intelligent”  algorithms  to  effectively  combine  these  modalities. 

In  a  broad  program  of  research  in  machine  vision  at  The  University  of  Texas  at  Austin, 
we  have  developed  an  approach  in  which  diverse  sensing  modalities  are  synergistically  com¬ 
bined.  The  synergistic  fusion  of  information  from  multiple  sensors  can  discern  additional  fea¬ 
tures  that  provide  better  discrimination  than  can  be  obtained  by  processing  the  sensor  inputs  sep¬ 
arately.  Our  work  has  focused  upon  building  physical  models  of  the  scene  in  order  to  relate  the 
signals  obtained  from  different  modalities  to  the  various  parameters  of  objects  in  the  scene. 
Based  on  these  estimated  parameters,  we  can  identify  and  evaluate  intrinsic  properties  of  the  ob¬ 
jects  in  order  to  interpret  the  scene. 

Four  broad  areas  of  inquiry  were  pursued  under  this  contract,  as  outlined  below: 

(1)  Outdoor  Scene  Interpretation  via  the  Fusion  of  Multiple  Imaging  Modalities 

(2)  Motion  Computation  and  Object  Recognition  Using  Range  Images 

(3)  Structure  and  Identity  Based  on  Color  and  Shape  Information 

(4)  Autonomous  Navigation 

Significant  results  from  the  projects  in  these  areas  of  inquiry  are  summarized  below. 
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SUMMARY  OF  IMPORTANT  RESULTS. 


The  research  projects  carried  out  under  this  contract  have  led  to  significant  progress  in  the 
study  of  some  of  the  most  difHcult  problems  in  computer  vision,  including  multisensor  systems 
for  outdoor  scene  interpretation,  algorithms  for  motion  computation  and  object  recognition  in 
cluttered  environments  using  range  images,  techniques  for  computing  structure  and  identity 
based  on  color  and  shape  data,  and  the  development  of  a  mobile  robot,  Robot-Tex,  as  a  testbed 
for  autonomous  navigation  algorithms. 
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1.  Outdoor  Scene  Interpretation  via  the  Fusion  of  Multiple  Imaging 
Modalities. 

a.  Automatic  Interpretation  Using  Multiple  Sensors. 

The  AIMS  (automatic  interpretation  using  multiple  sensors)  knowledge-based 
system  was  developed  to  interpret  registered  laser  radar  and  thermal  images  to  detect  and  recog¬ 
nize  man-made  objects,  such  as  armored  personnel  carriers  and  trucks,  in  outdoor  rural  scenes 
with  a  background  of  vegetation,  ground,  and  sky.  [1-6].  The  system  applies  the  multisensor  fu¬ 
sion  approach  to  multiple  ladar  modalities  to  improve  toth  segmentation  and  interpretation.  An 
early  version  of  the  system,  using  laser  radar  and  thermal  images,  was  presented  at  the  7th  IEEE 
Conference  on  Artificial  Intelligence  Applications  (1991),  where  it  received  the  conference's 
outstanding  paper  award  [1].  The  system  was  later  expanded  to  four  sensing  modalities,  range, 
intensity,  velocity,  and  thermal)  to  improve  image  segmentation  and  interpretation. 

The  interpretation  system  developed  [2]  is  not  limited  to  the  domain  of  object  recogni¬ 
tion,  and  could  also  be  used  for  robot  navigation  and  obstacle  avoidance.  The  design  of  the  rule 
bases  is  modularized  to  permit  future  expansion  of  the  system  to  incorporate  additional  sensing 
modalities.  The  system  applies  the  multisensor  fusion  approach  to  multiple  ladar  modalities  to 
improve  both  segmentation  (pixel-level  sensor  fusion)  and  interpretation  (object-level  sensor  fu¬ 
sion).  This  approach  offers  the  dual  advantages  of  (1)  the  ability  to  work  under  less  than  optimal 
imaging  environments  (rain,  night,  etc.)  and  (2)  the  ability  to  detect  objects  and  estimate  their  at¬ 
tributes  with  better  precision. 

The  use  of  different  sensors  provides  not  only  different  types  of  information,  but  also 
multiple  observations  of  the  same  information  through  different  channels.  The  knowledge-based 
interpretation  system  is  constructed  using  KEE  and  Lisp.  Low-level  attributes  of  image  seg¬ 
ments  (regions)  are  computed  by  the  segmentation  modules  and  then  converted  to  the  KEE  for¬ 
mat.  T^e  interpretation  system  applies  forward  chaining  in  a  bottom-up  fashion  to  derive  object- 
level  interpretation  from  input  generated  by  low-level  processing  and  segmentation  modules. 
The  interpretation  modules  detect  man-made  objects  from  the  background  using  low-level  at¬ 
tributes.  Segments  are  grouped  into  objects,  which  are  then  classified  into  predefined  categories 
(vehicles,  ^ound,  etc.).  The  efficiency  of  Ae  interpretation  system  is  enhanced  by  transferring 
nonsymbolic  processing  tasks  to  a  concurrent  service  manager  (program). 

The  AIMS  system  is  based  upon  a  new  integration  algorithm  that  integrates  multiple  re¬ 
gion  segmentation  maps  and  edge  maps  [5].  It  operates  independently  of  image  sources  and 
specific  region-segmentation  or  ^ge-detection  techniques.  User-specified  weights  and  the  arbi¬ 
trary  mixing  of  region/edge  maps  are  allowed.  The  integration  algorithm  enables  multiple  edge 
detection/region  segmentation  modules  to  woiic  in  parallel  as  front  ends.  The  solution  procedure 
consists  of  three  steps.  A  maximum  likelihood  estimator  provides  initial  solutions  to  the  posi¬ 
tions  of  edge  pixels  from  various  inputs.  An  iterative  procedure  using  only  local  information 
(without  edge  tracing)  then  minimizes  the  contour  curvature.  Finally,  regions  are  merged  to 
guarantee  that  each  region  is  large  and  compact  The  channel-resolution  width  controls  the  spa¬ 
tial  scope  of  the  initial  estimation  and  contour  smoothing  to  facilitate  multiscale  processing. 
Experimental  results  are  demonstrated  using  data  from  different  types  of  sensors  and  processing 
techniques.  The  results  show  an  improvement  over  individual  inputs  and  a  strong  resemblance  to 
human-generated  segmentation. 
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The  AIMS  system  has  been  ported  to  an  AT&T  Pixel  Machine  to  study  the  behavior  of  a 
rule-based  system  for  image  understanding  in  a  multiprocessor  environment  and  to  study  Ae 
hardware  and  software  requirements  for  such  an  implementation  [6].  The  AT&T  Pixel  Machine 
employs  a  distributed  memory  computer  architecture  with  message  passing.  Past  research  efforts 
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in  parallel  processing  of  rule-based  systems  used  the  fine-grain  Rete  algorithm  on  less  powerful 
processors.  However,  with  the  increased  computing  power  and  decreased  cost  of  new  multipro¬ 
cessor  hardware,  new  software  strategies  are  needed.  The  system  generates  a  rule  firing  rate  of 
700+  rules/second  and  has  a  linear  speedup  with  respect  to  the  number  of  processors,  on  a  plat¬ 
form  designed  for  graphics  rendering.  The  experiment  shows  that  given  strong  data  locality,  a 
distributed  memory  architecture  can  support  complex  software  systems,  including  rule  based  sys¬ 
tems,  without  using  fine-grained  parallelism. 

b.  Locating  Man-made  Objects  in  Outdoor  Scenes. 

A  new  approach  for  the  detection  of  large  man-made  objects  was  developed  using 
perceptual  oi^anization  techniques  to  group  low  level  features  in  a  scene  to  determine  a  region  of 
interest  that  is  most  likely  to  contain  man-made  objects  [7-9].  In  this  study,  the  man-made  ob¬ 
jects  may  be  unspecified  and  the  appearance  of  the  objects  is  unpredictable.  The  approach  ap¬ 
plies  the  principles  of  perceptual  organization  and  makes  use  of  prominent  features  that  distin¬ 
guish  man-made  objects  from  natur<d  objects.  Using  computer  vision  techniques  such  as  feature 
extraction,  primitive  structure  formation,  and  segmentation,  we  hierarchically  group  low-level 
image  features  into  a  region-of-interest,  an  area  which  is  deemed  mostly  likely  to  enclose  man¬ 
made  objects  or  a  substantial  part  of  a  man-made  object.  A  paper  based  upon  this  work  was 
published  in  Pattern  Recognition,  and  received  the  Honorable  Mention  of  the  Pattern 
Recognition  Society  Award  for  Outstanding  Contribution,  November  1993. 

The  goal  of  our  research  is  to  detect  man-made  objects  from  images  of  natural  scenes. 
Since  the  objects  are  not  particularly  specified,  features  must  be  found  that  distinguish  man-made 
objects  from  natural  objects  in  an  image.  Two  of  the  most  prominent  characteristics  of  manmade 
objects  are  the  apparent  regularity  and  relationship  of  their  components.  Most  man-made  objects 
have  linear  structures  or  linear  boundaries  that  form  certain  regular  patterns,  such  as  rectangles, 
parallels  and  iwlygons.  These  regular  patterns  are  usually  related  to  each  other  and  form  the 
man-made  objects.  After  line  detection,  much  of  such  regularity  and  relationship  remains 
apparent.  To  detect  the  man  made  objects,  we  must  extract  the  geometric  structures  that  exhibit 
regularity  and  relationship  from  the  image.  Hence,  the  framework  of  our  approach  includes  three 
phases;  (1)  extracting  image  features,  (2)  finding  regularities  and  relationships  among  these 
features,  and  (3)  identifying  the  region  occupied  by  the  related  regular  structures. 

The  first  level  of  grouping  extracts  image  features.  We  currently  consider  two  kinds  of 
features:  linear  structures  (LS)  and  coterminations  (CT).  A  linear  structure  in  an  image  is  a 
representation  of  a  set  of  approximately  collinear  line  segments  which  are  close  and  likely  to 
come  from  the  same  linear  structure  in  the  scene.  Extracting  LS  reflects  the  proximity, 
collinearity,  and  continuation  properties  of  perceptual  grouping.  A  cotermination  is  a  set  of  lines 
terminating  at  a  common  point  or  a  small  common  region.  The  cotermination  is  an  important 
relation.  Cotermination  is  a  non-accidental  relationship  and,  hence,  reflects  significant  structural 
information.  It  is  also  view  invariant  in  a  wide  range  of  viewpoints  and  can  be  used  for  3D 
inference.  The  CT  are  represented  by  a  graph  called  the  CT  graph. 

The  second  level  grouping  process  organizes  features  that  exhibit  regularity  and 
relationship  into  larger  structures  call^  primitive  structures  (PS).  We  consider  two  kinds  of  PS: 
parallel  PS  and  polygon  PS.  A  parallel  PS  is  a  set  of  parallel  lines  satisfying  certain  conditions. 
A  polygon  PS  is  a  closed  figure  that  consists  of  line  segments  and  satisfies  certain  criteria. 
Parallel  and  polygon  PS  are  higher  level  structures  than  lines  and  coterminations.  The  PS  are 
represented  by  a  graph  called  the  PS  graph  that  describes  the  spatial  relationships  among  the  PS. 
Such  a  graph  facilitates  higher  level  processing.  Using  the  PS  graph,  the  third  phase  of  the 
framework  groups  spatially  closed  PS,  elimmates  the  isolated  ones,  and  segments  the  image  into 
regions  occupied  by  the  grouped  PS  and  a  ackground.  The  largest  region  of  the  grouped  PS  is 
then  evaluated  based  on  the  area  of  the  region  and  the  statistics  of  the  PS.  If  this  region  is 
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detemuned  to  be  significant,  it  is  most  likely  to  enclose  man-made  objects  or  a  substantial  part  of 
the  man-made  objects,  and  thus  is  considered  to  be  the  region  of  interest  (ROI).  Figure  2 
illustrates  the  ovei^  data  representation,  relationship,  and  flow  in  this  framework. 


The  study  used  monochrome  images  containing  man-made  objects  such  as  bridges  and 
electric  transmission  towers  in  complex  backgrounds.  By  locating  the  region  of  interest  in  the 
image,  the  search  space  is  substantially  reduced  from  the  whole  image  to  that  region.  This 
technique  could  be  useful  for  screening  a  large  number  of  images  for  automatic  object 
recognition  or  for  a  human-machine  system.  For  an  automatic  system,  when  specific  object 
classes  are  given  and  models  are  established,  the  primitive  structures  composing  the  region  of 
interest  can  be  matched  to  object  models  rather  than  to  individual  features.  This  will 
considerably  reduce  the  search  space  for  matching,  since  more  constraints  are  applied.  For  a 
human-machine  system,  the  region  of  interest  can  be  used  as  a  focus-of-attention  for  human 
expertise  to  further  examine  the  image. 

c.  Modeling  Non-Homopeneous  3-D  Objects  for  Thermal  and  Visual  Image 

Synthesis. 

An  approach  to  the  integrated  modeling  of  3-dimensional  objects  was  developed 
that  supports  the  synthesis  of  visual  and  thermal  images  under  different  viewing,  ambient,  and 
internal  conditions  [10].  Object  modeling  is  accomplished  using  the  volume  surface  octree,  a 
representation  which  is  well  suited  for  thermal  modeling  of  complex  objects  with  non-homo¬ 
geneities  and  heat  generation.  An  improved  technique  for  constructing  the  volume  surface  octree 
increases  storage  efHciency  without  degrading  the  quality  of  the  image  sjmthesis.  A  technique  is 
used  to  incorporate  non-homogeneities  and  heat  generation  using  octree  intersection.  A  compu¬ 
tationally  efficient,  implicit  fine  difference  methr^  is  used  to  simulate  heat  flow  of  objects  with  a 
large  number  of  octree  nodes  and  with  non-homogeneities.  The  model  is  designed  to  be  used  in 
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a  muldsensor  vision  system  that  would  use  the  images  and  features  predicted  by  this  model  in  a 
hypothesize-and-verify  scheme. 

d.  Object  Recognition  Using  ART-2  Artificial  Neural  Network. 

The  Hybrid  Architecture  for  Man  Made  Object  Recognition  (HAMMER)  is  an 
object  recognition  system  that  uses  the  Adaptive  Resonance  Theoty  (ART)-based  ART-2  artifi¬ 
cial  neural  network  [11].  ART  networks  are  one  of  the  most  promising  neural  network  architec- 
ti^s  for  image  recognition  applications.  The  goal  of  this  object  recognition  system  is  to  recog¬ 
nize  and  classify  man-made  moving  objects  from  a  sequence  of  2D  images.  The  objects  appear 
in  natural  scenes  containing  trees,  buildings,  landscapes,  etc.  As  shown  in  Figure  3,  the 
HAMMER  system  architecture  incorporates  a  preprocessing  module  that  extracts  invariant 
features  from  the  input  image,  which  are  then  used  by  the  neural  network  for  object  classification 
and  r»:ognition.  The  preprocessing  module  consists  of  two  stages.  In  the  first  stage,  the  object 
(image  figure)  to  be  recognized  is  segmented  from  the  image  background.  This  is  accomplished 
by  determining  the  Region  of  Interest  (ROI)  in  the  image.  In  the  second  stage,  a  transformation 
is  applied  to  the  image,  whose  output  spectra  are  invariant  with  image  transformations  such  as 
2D  rotation,  scaling  and  translation.  The  features  extracted  from  this  final  preprocessing  stage 
are  then  input  to  ART-2  for  unsupervised  classification.  Based  on  the  output  of  the  ART-2 
module,  the  DECISION  LAYER  then  labels  (i.e.,  names)  the  input  object.  The  HAMMER 
system  is  iinplemented  in  C++  on  a  PC  486  and  was  tested  using  images  of  different  types  of 
military  vehicles  in  natural  surroundings.  The  HAMMER  system  correctly  classified  about  90% 
of  the  input  objects.  Overall,  this  system  is  capable  of  performing  successfully  in  a  complex, 
unknown  environment. 


Figure  3.  The  HAMMER  system 
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2.  Motion  Computation  and  Object  Recognition  Using  Range  Images. 

Computer  vision  techniques  play  a  significant  role  in  a  number  of  fields  of  engi¬ 
neering  such  as  robotics  and  manufacturing.  In  most  applications  it  is  important  to  sense  the  en¬ 
vironment  and  estimate  the  relative  motion  between  the  sensor  and  the  objects  in  the  environ¬ 
ment.  For  example,  when  a  robot  moves  in  a  work  space  it  is  important  for  the  robot  to  know  its 
motion  relative  to  the  other  objects  in  the  work  space. 

Traditionally,  many  vision  systems  have  used  a  video  camera  as  the  primaty  sensor. 
However,  video  images  provide  no  depth  information  about  the  scene.  Human  beings  have 
remarkable  vision  systems  that  can  extract  3-D  information  from  2-D  images,  but,  at  the  present 
state  of  technology,  it  is  impossible  for  a  computer  to  achieve  the  same  level  of  performance. 
However,  this  shortcoming  can  be  overcome  by  using  range  image  sensors  in  the  compute 
vision  system.  A  range  sensor  directly  obtains  the  3-D  depth  information  of  the  scene.  This 
information  (called  a  3-D  range  image)  now  can  be  used  by  the  computer  to  reason  about  the 
environment  and  accomplish  complex  tasks. 

a.  Surface  Correstx)ndence  and  Motion  Computation  from  a  Sequence  of  Ranee 

Images. 

We  have  addressed  the  problem  of  determining  the  motion  of  the  range  sensor  as 
the  sensor  moves  relative  to  the  objects  in  the  s^'ene  [12-14].  The  motion  must  be  estimated 
from  a  3-D  range  image  sequence  obtained  from  the  range  sensor.  The  key  to  solving  this  prob¬ 
lem  is  identifying  significant  features  in  the  images  and  using  those  features  to  compute  the  mo¬ 
tion  transformation.  The  range  sensor  senses  the  points  on  the  object  surface  and  generates  a  3-D 
map  of  the  scene.  In  such  an  imaging  modality  it  is  logical  to  use  the  object  surfaces  as  the  fea¬ 
tures  to  be  used  in  the  motion  computation  task.  We  have  developed  surface-based  image  pro¬ 
cessing  techniques  that  are  used  in  high  level  vision  tasks. 

The  object  surfaces  are  extracted  from  each  3-D  range  image.  The  next  task  is  to  track 
each  surface  segment  over  the  image  frames  in  the  sequence.  This  is  a  fairly  complex  task,  and 
to  add  to  the  complexity,  the  surfaces  may  become  hidden  by  other  surfaces  as  the  sequence  pro¬ 
gresses,  or  new  surfaces  that  were  hidden  earlier  may  appear  in  the  images.  The  vision  system 
developed  at  our  research  center  handles  such  occurrences  and  reliably  arrives  at  the  solution. 

The  system  uses  the  geometry  and  the  topology  of  the  scene  to  establish  the  corre¬ 
spondence  between  surfaces  in  different  frames  of  the  image  sequence.  This  task  is  facilitated  by 
the  use  of  a  representation  scheme  based  on  a  hypergraph.  After  establishing  correspondence 
between  surface  segments  in  the  image  frames,  the  next  task  is  to  compute  the  motion 
transformation.  The  computation  of  rigid  motion  transformation  is  a  nonlinear  problem  to  which 
the  solution  is  not  simple.  If  the  surfaces  are  planar  the  motion  computation  task  is  significmtly 
simplified,  and  the  rotations  and  the  translations  can  be  estimated  with  reasonable  certainty. 
However,  in  the  case  of  the  next  order  of  surfaces,  i.c.,  quadric  surfaces,  the  motion  computation 
equations  become  intractable.  There  is  no  guarantee  that  the  solution  obtained  is  the  correct  and 
optimal  solution.  We  solve  this  problem  by  appealing  to  the  geometry  of  the  quadric  surfaces.  It 
has  been  known  that  every  quadric  has  a  unique  linear  feature.  We  extract  this  linear  feature  and 
use  it  to  compute  motion.  Our  approach  uses  object  surfaces  as  the  basis  of  all  the  tasks  involved 
in  motion  computation.  This  results  in  reliable  estimates  of  the  motion  transformation  and  the 
computation  of  correspondences.  The  procedure  to  compute  motion  using  the  linear  features  of 
quadric  surfaces  is  novel  and  overcomes  the  problems  faced  by  earlier  approaches  that 
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formulated  complex  non-linear  optimization  problems  to  compute  the  motion  transformation. 
This  approach  can  also  be  extended  to  other  higher  order  surfaces. 


b.  Segmentation  of  3D  Range  Images  Using  Pyramidal  Data  Structures. 


The  first  step  in  most  computer  vision  systems  is  to  represent  the  acquired  data 
using  symbolic  descriptions.  Such  descriptions  are  then  used  to  perform  higher  level  vision 
tasks,  such  as  object  recognition,  motion  estimation,  or  navigation.  To  obtain  the  symbolic  de¬ 
scriptions,  the  input  data  must  be  partitioned,  or  segmented,  into  a  set  of  primitives.  The  seg¬ 
mentation  depends  on  die  nature  of  the  input  data  and  on  the  primitives  used  by  the  high  level 
tasks.  We  have  addressed  the  problem  of  segmenting  data  that  represent  the  3D  coordinates  of 
each  point  in  the  scene  (i.e.,  dense  range  images).  Specifically,  the  problem  is  stated  as:  Given  a 
3D  range  image  of  a  scene  containing  multi  fie  arbitrarily  shaped  objects,  segment  the  scene  into 
homogeneous  surface  patches.  We  have  proposed  a  new  modular  framewo^  for  this  segmenta¬ 
tion  task  [IS].  The  segmentation  task  is  addressed  within  a  framework  of  a  vision  system  in 
which  the  ou^ut  of  the  segmentation  module  is  not  the  Hnal  objective.  Instead,  the  segmentation 
procedure  ou^ut  must  be  capable  of  interpretation  by  the  higher  level  modules.  The  high  level 
vision  tasks  dictate  to  the  segmentation  m^ule  the  criteria  for  uniformity  of  regions  and  the  rep¬ 
resentation  of  the  output 

The  framework  consists  of  two  independent  modules,  the  first  of  which  performs  low 
level  segmentation  and  the  second  carries  out  the  subsequent  merging  of  regions.  The  segmen¬ 
tation  is  accomplished  by  an  iterative  pyramidal  clustering  scheme  using  zeroth  and  first-order 
local  surface  properties  at  each  point  of  the  scene.  The  segmentation  is  then  refined  in  the  sec¬ 
ond  module  using  high  order  surface  representations  dictated  by  the  upper  level  vision  tasks. 

The  procedure  has  been  applied  successfully  to  many  range  images  obtained  from  various 
institutions.  This  procedure  offers  a  number  of  advantages  over  existing  segmentation  proce¬ 
dures.  By  using  the  modular  framewoik,  the  low  level  segmentation  process  is  independent  of 
the  surface  type  and  description,  and  the  high  level  process  is  independent  of  the  local  properties 
derived  from  the  input  data  as  well  as  from  the  method  used  to  achieve  the  oversegmentation. 
Further,  no  restrictions  are  placed  on  the  type  or  size  of  objects  in  the  scene.  Unlike  most  exist¬ 
ing  segmentation  schemes,  the  procedure's  dependency  upon  empirically  determined  threshholds 
is  minimal.  Finally,  the  pyramidal  algorithms  may  be  implemented  in  p^allel. 


c.  The  Convergence  of  Fuzzv  Pyramid  Algorithms. 


Pyramid  linking  is  an  important  technique  for  segmenting  images  and  has  many 
applications  in  image  processing  and  computer  vision.  The  algorithm  is  closely  related  to  the 
ISODATA  clustering  algorithm  and  shares  some  of  its  properties.  We  have  investigated  this  re¬ 
lationship  and  developed  a  proof  of  convergence  for  the  pyramid  linking  algorithm  [16].  The 
convergence  of  the  "hard"  pyramid  linking  algorithm  has  biren  shown  in  the  past;  however,  there 
has  been  no  proof  of  the  convergence  of  "fuz^"  pyranud  linking  algorithms.  The  proof  of  con¬ 
vergence  is  based  on  Zangwill's  theorem,  which  describes  the  convergence  of  an  iterative  algo¬ 
rithm  in  terms  of  a  "descent  function"  of  Ae  algorithm.  We  show  the  existence  of  such  a  descent 
function  on  the  pyramid  algorithm,  and  demonstrate  that  all  the  conditions  of  Zangwill's  theorem 
are  me;  hence,  the  algorithm  converges. 


d.  CAD-Based  Object  Recognition. 


We  have  addressed  the  problem  of  CAD-based  object  recognition,  in  which  the 
objective  is  use  a  three-dimensional  CAD  model  of  an  object  to  locate  that  object  in  a  scene  con¬ 
taining  several  overlapping  objects,  arbitrarily  positioned  and  oriented  [17-20].  A  laser  range 
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scanner  is  used  to  collect  3D  data  points  from  the  scene.  The  collected  data  is  segmented  into 
surface  patches,  and  the  segments  are  used  to  calculate  various  3D  surface  properties.  CAD 
models  are  designed  using  conuneicially  CADKEY  and  accessed  via  the  industry  standard  IGES. 
The  models  are  analyzed  off-line  to  derive  various  geometric  features,  their  relationships,  and 
their  attributes.  A  strategy  for  identifying  each  model  is  then  automatically  generated  and  stored. 
The  strategy  is  applied  at  run-time  to  complete  the  task  of  object  recognition.  The  goal  of  the 
generated  strategy  is  to  select  the  model's  geometric  features  in  the  sequence  which  may  best  be 
suited  to  identify  and  locate  the  model  in  the  scene.  The  generated  strategy  is  guided  by  several 
factors,  including  the  visibility,  detectability,  frequency  of  occurrence  and  topology  of  the  fea¬ 
tures. 

This  object  recognition  system  signiHcantly  differs  from  previous  systems,  in  that  it  uses 
a  commercial  3D  CAD  system  and  IGES  interface.  Using  CADKEY  and  IGES  decreases  the 
dependency  of  the  vision  system  on  any  particular  CAD  modeler  and  increases  its  applicability. 
Moreover,  the  model  description,  derived  automatically  from  IGES,  is  used  to  systematically  de¬ 
rive  a  matching  strategy  from  a  geometric  model.  By  using  the  recognition  strategy,  it  is  not 
necessary  to  consider  ^1  the  possible  matching  combinations  of  sensory  features  and  model  fea¬ 
tures,  increasing  the  efficiency  of  the  system.  Precompiling  the  recognition  strategy  also  in¬ 
creases  the  vision  system's  run-time  efficiency,  since  less  time  is  spent  on  model  analysis  during 
task  execution.  Finally,  the  matching  strategy  is  not  significantly  dependent  on  moment-based 
and  boundary-based  features,  and,  unlike  many  previous  approaches  to  object  recognition,  the 
system  does  not  require  an  unconditional  one-to-one  matching  of  sensory  features  and  model  fea¬ 
tures.  Thus,  the  recognition  system  is  not  sensitive  to  the  partial  occlusion  of  objects,  and  the 
oversegmentation  of  surface  patches  is  easily  tolerated. 
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3.  Structure  and  Identity  Based  on  Color  and  Shape  Information 

a.  Integration  of  Image  Segmentations  Maps  Using  Region  and  Edge  Information. 

As  mentioned  briefly  above  in  Section  1(a),  an  algorithm  has  been  developed  for 
the  AIMS  interpretation  system  that  integrates  image  segmentation  maps  using  region  and  edge 
information  wiAout  the  intervention  of  high-level  knowledge  [5].  One  problem  encountered  in 
using  multiple  sensing  modalities  is  that  when  different  segmentation  techniques  are  applied  to 
the  images  obtained  from  different  modalities,  different  segmentation  maps  are  generated.  One 
has  to  resolve  the  differences  between  all  such  segmentation  maps  to  benefit  from  the  rich 
information  provided  by  various  sources.  Information  integration  is  a  suitable  approach  to 
enhance  system  performance  by  verifying  cues  from  one  source  to  another.  It  is  also  necess^ 
because  of  significant  information  loss  during  the  image  acquisition  process.  Information 
integration  improves  the  signal-to-noise  ratio,  because  information  that  is  consistent  among 
different  sources  is  reinforced,  while  information  that  is  contradicted  is  attenuated. 

This  algorithm  integrates  segmentations  from  different  sensing  modalities,  segmentation 
techniques,  and  control  parameters.  It  operates  independently  of  the  problem  domains,  the  seg¬ 
mentation  techniques,  and  any  combination  of  edge  and  region  mai».  The  basic  task  of  the  al¬ 
gorithm  is  estimation  by  generating  a  consensus  of  the  true  underlying  segmentation  from  mul¬ 
tiple  observations.  FurAer,  the  algorithm  allows  the  flexibility  of  user-specified  weights  on  dif¬ 
ferent  information  sources,  since  they  may  not  be  equally  reliable. 

We  assume  that  a  true  region  contour  map  exists  as  the  signal  source,  but  that  the  signal 
is  contaminated  by  noise  during  image  acquisition,  preprocessing,  and  segmentation.  The  objec¬ 
tive  of  the  integration  module  is  to  recover  the  original  contour  map  from  multiple  contaminated 
copies  using  minimal  knowledge  about  the  signal  and  noise  sources. 

The  work  uses  only  the  contour  (position  and  length),  the  size,  and  the  neighboring  rela¬ 
tionship  attributes  of  regions.  Other  regional  information  is  not  used,  because  it  would  require 
the  integration  module  to  know  about  the  construction  of  the  front-end  (region-growing-based) 
segmentation  algorithms  and  about  their  operational  results  before  a  contour  is  generated.  As 
shown  in  Figure  4,  the  integration  procedure  consists  of  three  stages;  (1)  initial  estimation-es¬ 
timating  the  true  region  contours  from  given  multiple  observations,  (2)  contour  smoothing-di- 
rectly  reducing  curvature,  and  (3)  constraint  satisfaction-satisfying  additional  nonnegotiable 
constraints  on  the  integration  output  according  to  the  application  context  The  integration  prob¬ 
lem,  therefore,  is  formulated  as:  Given  several  sets  of  edge  pixels  and  associated  weights,  solve 
for  another  set  of  edge  pixels  representing  all  input  sets  and  exhibiting  certain  properties.  Our 
solution  procedure  decomposes  the  original  formulation  into  three  stages  to  pursue  a  suboptimal 
solution.  Tlie  first  stage  is  an  estimator  that  generates  and  initial  solution,  while  ensuring  that  the 
solution  is  a  negotiated  result  from  all  the  inputs  and  that  the  output  is  a  sufficient  representative 
of  the  input  data.  Tlie  second  stage  employs  a  potential-energy  model  to  minimize  the  smooth¬ 
ness  iteratively.  The  Aiid  stage  checks  the  unnegotiable  constraints  and  merges  regions  that  vio¬ 
late  such  constraints.  The  work  uses  maximum  likelihood  estimation  as  the  estimation  strategy. 
A  priori  information  is  not  considered,  since  it  is  usually  unavailable  in  practical  situations.  The 
algorithm  addresses  the  issue  of  figund  continuity  and  incoiporates  this  concern  into  the  solution 
procedure.  During  the  initial  estimation  process,  continuity  influences  which  data  points  are 
considered  in  the  weighted  average  operation.  After  the  initial  estimation,  the  contour  connec¬ 
tivity  constraint  enforces  connections  between  pixels  that  are  originally  connected.  The  curve 
smoothing  stage  uses  the  spring-node  model  to  refine  node  positions  iteratively,  subject  to  the 
finite-drift  and  connected- neig^or  constraints.  The  algorithm  treats  juncture  and  nonjuncture 
pixels  uniformly,  and  satisfies  nonnegotiable  constraints  on  region  size  and  contour  cornpacmess 
if  a  region  map  is  desired. 
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Figure  4.  System  Overview  -  Integration  of  Image  Segmentation 
Maps  Using  Region  and  Edge  Information. 


This  algorithm  is  not  another  segmentation  algorithm,  but  is  an  integration  algorithm  ca¬ 
pable  of  using  all  established  segmentation  algorithim  front  ends.  Such  an  integration  algo¬ 
rithm  allows  many  fast  (but  not  necessarily  intelligent)  segmentation  algorithms  to  be  used  in 
parallel  to  achieve  fast  segmentation,  and  their  results  to  be  fed  into  the  integration  module.  For 
example,  multiple  shape-from-X  techniques  can  operate  in  parallel  and  then  have  their  results 
integrated.  Consequently,  the  need  to  design  a  single  supmor  algorithm  for  segmentation  is 
much  less  critical.  Since  only  local  information  is  used  in  the  initid  estimation  and  in  the  con¬ 
tour  smoothing  stages,  the  algorithm  may  be  ported  to  parallel/distributed  computing  hardware 
when  maximum  operating  sp^  is  requii^. 

b.  Extraction  and  Interpretation  of  Semantically  Significant  Line  Segments  for  a 
Mobile  Robot. 

We  have  developed  a  new  approach  to  extracting  important  line  segments  from 
monocular  images  in  order  to  estimate  the  ^sition  of  important  objects  in  the  path  of  an  au¬ 
tonomous  robot  [21-24].  A  paper  based  upon  this  work  received  the  Phillips  Award  for  Best 
Paper  at  the  IEEE  Computer  Society  International  Conference  on  Robotics  and  Automation, 
Nice,  France  (1992).  All  stages  of  image  interinetation,  including  the  lowest  processing  level, 
are  designed  to  provide  the  higher  stages  with  the  most  semantically  useful  features. 

The  robot's  tasks  are  usually  specified  in  high-level  semantic  terms,  such  as  "go  down  the 
hallway  and  go  through  the  last  door  on  the  left"  In  order  to  execute  this  task,  the  robot  must  be 
able  to  identify  the  objects  of  interest — ^hallways  and  doors —  in  its  perception  of  the  environ¬ 
ment  One  approach  is  to  reconstruct  3D  line  segment  descriptions  of  the  environment  from  sev¬ 
eral  intensity  images,  which  are  then  grouped  and  matched  to  selected  object  models.  For  this, 
several  3D  segments  are  selected  according  to  orientation  and  position  criteria.  The  basic  idea  is 
that  the  architecture  in  most  indoor  scenes  contains  edges  with  particular  orientations  in  3D,  such 
as  vertical  and  two  horizontal  orientations  perpendicular  to  each  other.  In  a  2D  perspective  pro¬ 
jection,  all  the  edges  with  a  given  3D  orientation  appear  to  converge  to  a  single  point,  called  the 
vanishing  point  (Figure  5).  By  precomputing  the  position  of  vanishing  points  in  each  image,  it  is 
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possible  to  find  the  most  likely  3D  orientation  of  observed  edges,  even  from  a  single  monocular 
image.  A  3D  hypothesis  of  an  object  is  generated  and  matched  to  the  selected  segments.  The 
process  is  repeat^  for  other  segments  and  other  hypotheses.  The  high  level  semantic  interpreta¬ 
tion  is  then  used  to  determine  the  free  space  or  to  find  objects  of  interest. 


To  vanishing  point  Vj 


Figure  5.  Perspective  Projection  of  a  Scene 


Since  high-level  interpreters  are  looking  for  line  segments  of  particular  orientations  in 
3D,  we  designed  the  lower  level  image-processing  stages  to  take  advantage  of  this  information. 
This  top-down  information  can  benefit  the  feature  extraction  stage  by  reducing  the  number  of 
unwanted  features,  increasing  sensitivity  to  good  features,  and  drastic^ly  speeding  the  computa¬ 
tion.  Preliminary  results  of  this  work  were  presented  that  the  1992  IEEE  Inti.  Conference  on 
Robotics  and  Automation,  in  a  paper  that  received  the  Best  Student  Paper  Award  of  the  confer¬ 
ence  [21]. 


c.  3D  Structure  Reconstruction  from  an  Ego  Motion  Sequence  Using  Statistical 
Estimation  and  Detection  Theory. 

We  have  derived  a  decision-theoretical  algorithm  to  estimate  3D  structures  from 
extended  sequences  of  2D  images  ridcen  by  a  moving  camera  [25],  We  assume  that  the  camera 
motion  is  known  and  that  the  world  is  stationary.  The  3D  structures  of  interest  are  3D  lines,  be¬ 
cause  they  are  relatively  stable  and  easy  to  extract  from  images.  Traditionally,  feature-based 
motion  analysis  involves  several  separate  operations:  feature  detection,  feature  matching, 
structure/motion  estimation,  and  higher  level  processing,  such  as  feature  grouping.  Most  of  these 
operations  were  originally  designed  to  operate  on  just  one  or  two  images,  and  Uius  they  did  not 
take  advantage  of  having  an  extended  sequence  of  images.  The  decision-theoretical  algorithm, 
however,  uses  statistical  estimation  and  detection  theory  to  integrate  these  operations.  Statistical 
estimation  theory  has  been  used  extensively  in  sequential  3D  structure  reconstruction  problems. 
The  basic  idea  is  that  if  the  3D  world  remains  stationary,  then  the  unknown  3D  structures,  such 
as  depths  or  line  parameters,  can  be  considered  as  state  variables;  the  changes  in  their  values  due 
to  the  camera's  ego-motion  can  be  modeled  by  the  system's  dynamics,  and  the  information  ex- 
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tracted  from  images  is  related  to  the  unknowns  via  the  so-called  observation  or  measurement 
models.  The  structure  estimation  can  thus  be  posed  as  the  state  reconstruction  problem  in  dy¬ 
namic  system  theory.  The  novelty  of  our  work  is  that  the  statistical  detection  theory  is  used  to 
model  the  other  phases  of  the  operation,  and  thus  provides  a  natural  way  to  incorporate  temporal 
information  into  those  phases  and  to  integrate  the  whole  system.  This  formulation  is  more  robust 
against  matching  errors  because  it  represents  the  matching  ambiguities  by  their  probabilities  and 
explicitly  incorporates  the  information  into  the  estimation  update  Feature  tracking  and  parame¬ 
ter  estimation  are  closely  coupled  in  this  formulation.  The  recursive  estimation  algorithm  pro¬ 
vides  the  prediction  used  for  tracking  features,  and  the  tracking  algorithm  evaluates  the  probabil¬ 
ities  used  for  updated  estimation.  Tliis  algorithm  could  be  easily  modified  to  include  other  prop¬ 
erties  associate  with  lines,  such  as  leng&,  midpoint,  or  contextual  information  associated  with 
2D  edges,  as  long  as  they  can  be  represent^  in  a  parametric  form.  The  algorithm  could  also  be 
modified  for  other  types  of  features,  such  as  points  or  curves,  as  long  as  they  can  be  represented 
by  a  finite  number  of  parameters. 

d.  Line  Correspondences  from  Cooperating  Spatial  and  Temporal  Grouping 

Processes  for  a  Sequence  of  Images. 

Our  group  has  developed  a  new  algorithm  for  matching  line  segments  based  on 
perceptual  grouping  relaxation  labeling  [26].  We  consider  feature  matching  between  two  views 
as  a  "temporal  grouping"  process,  in  addition  to  the  traditional  spatial  groups  established  by 
perceptual  grouping  in  a  single  images.  The  relaxation  labeling  paradigm  is  used  to  integrate 
spatial  and  temporal  grouping.  In  the  relaxation  process,  correspondence  ambiguities  are 
resolved  by  iteratively  propagating  constraint  information  horn  the  nodes  of  a  line,  which  are 
deHned  by  the  line's  perceptual  groups.  The  system  is  suitable  for  indoor  or  urban  structured 
environments.  This  line  matching  algorithm  has  three  new  developments:  (1)  rigorous 
perceptual  grouping  processes  based  on  the  statistical  inference  paradigm;  (2)  genendization  of 
the  perceptual  grouping  concept  to  the  temporal  domain;  and  (3)  the  cooperation  of  spatial  and 
temporal  grouping  processes  using  relaxation  labeling  techniques.  We  detect  edges  using  the 
Canny  edge  operator  and  then  extract  lines  using  the  Object  Recognition  Toolkit.  Tlie  algorithm 
uses  the  relaxation  labeling  paradigm  to  match  a  set  of  lines  in  two  images.  Initially,  a  line  in  the 
Hrst  image  has  multiple  matching  candidates  in  the  second  image.  This  ambiguity  is  resolved  by 
iteratively  propagating  constraint  information  from  a  line's  neighbors,  which  are  deHned  in  the 
iterative  up^te  process  by  the  line's  perceptual  groups.  Traditional  perceptual  grouping  requires 
heuristic  threshold  values.  We  define  the  grouping  process  using  the  statistical  inference 
paradigm.  The  advantage  of  this  algorithm  over  previous  work  is  that  we  simultaneously 
hypothesize  and  test  both  temporal  and  spatial  relations  among  2D  lines,  allowing  one  relation  to 
be  used  as  supporting  evidence  for  the  other. 


e.  Stereo  Image  Interpretation  in  the  Presence  of  Narrow  Occluding  Objects. 

In  a  study  of  the  use  of  stereo  vision  to  establish  object  correspondence,  we  re¬ 
viewed  major  developments  in  establishing  stereo  correspondence  for  the  extraction  of  3-D 
structure  of  a  scene;  identiHed  broad  categories  of  algorithms  based  upon  differences  in  image 
geometry,  matching  primitives,  and  computational  structure;  and  reviewed  the  performance  of 
tiiese  stereo  techniques  on  various  test  images  [27]. 

In  this  study,  we  examined  the  twin  issues  of  the  gain  in  accuracy  of  stereo  correspon¬ 
dence  and  the  accompanying  increase  in  computational  cost  due  to  the  use  of  a  third  camera  for 
stereo  analysis  [28].  Trinocular  stereo  algorithms  differ  from  binocular  algorithms  in  the  epijro- 
lar  constraint  used  in  the  local  matching  stage.  The  current  literature  does  not  provide  any  in¬ 
sight  into  the  relative  merits  of  binocular  and  trinocular  stereo  matching  with  the  matching  accu¬ 
racy  being  verified  against  the  ground  truth.  We  conducted  experiments  to  evaluate  the  relative 
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performance  of  binocular  and  trinocular  stereo  algorithms  using  stereo  images  generated  by  ap¬ 
plying  a  Lambertian  reflectance  model  to  real  digital  elevation  maps  (OEMs)  from  the  U.  S. 
Geologica’  Survey.  "Die  notching  accuracy  of  the  stereo  algorithms  was  evaluate  by  comparing 
the  observed  stereo  disparity  against  the  ground  truth  deriv^  from  the  DEMs.  We  observ^  that 
trinocular  local  matching  r^uced  the  percentage  of  mismatches  having  large  disparity  errors  by 
more  than  half,  as  compared  to  binocular  match^g,  but  increased  the  computational  cost  by  only 
about  one  fourth.  We  also  performed  a  quantization-error  analysis  of  the  depth  reconstruction 
process  for  the  nonparallel  stereo-imaging  geometry  used  in  the  experiments 

We  have  developed  an  approach  to  stereo  vision  that  utilizes  the  Dynamic  Disparity 
Search  ^DS)  framework,  which  combines  the  spatial  hierarchy  with  a  new  disparity  hierarchy 
mechanism  to  reduce  stereo  matching  errors  caused  by  the  presence  of  narrow  occlut^g  objects 
[29-30].  Narrow,  occluding  objects  in  stereo  images  cause  matching  errors  that  cannot  be  han¬ 
dled  by  the  spatid  hierarchy  method  alone.  The  merits  of  the  DDS  approach  are  demonstrated 
on  real  stereo  images. 

Most  contemporary  stereo  correspondence  algorithms  impose  global  consistency  among 
candidate  match-points  using  only  a  spatial  hierarchy  mechanism.  As  mentioned  above,  the  spa¬ 
tial  hierarchy  mechanism  cannot  han(Ue  the  presence  of  narrow  occluding  objects.  We  have  ana¬ 
lyzed  the  stereo  matching  failures  caused  by  the  spatial  hierarchy  mechanism  [31]  and  formu¬ 
lated  a  new  global  matching  framework.  Experimental  results  show!  a  signiEcant  d^rease  in  the 
false  positive  match-rate  using  this  fhmnework. 

f.  Analysis  of  Video  Images  Using  Point  and  Line  Correspondences. 

We  have  investigated  the  problem  of  analyzing  time- varying  imagery  using  a 
feature-based  approach  [32].  We  assume  a  scenario  in  which  the  imaged  objects  remain  station¬ 
ary  while  the  camera  moves.  The  goal  is  to  compute  the  structure  of  the  imaged  objects  and  the 
motion  of  the  cmera  from  a  sequence  of  video  images.  Our  method  exploits  the  principle  of  the 
invariance  of  rigid  conEgurations  during  motion.  Using  the  rigidity  constraints,  we  specify 
^nations  based  on  distance  and  angular  invariance  to  compute  the  structural  parameters  of  the 
imaged  3-D  objects  independently  of  the  camera's  motion.  Once  the  structural  parameters  are  re¬ 
covered,  Ae  motion  parameters  can  be  computed.  The  advantage  of  this  approach  is  in  the  de¬ 
composition  of  the  computations  of  structure  and  motion,  and  the  simultaneous  use  of  point  and 
line  correspondences,  which  allows  our  approach  to  not  be  limited  to  objects  whose  images  have 
only  a  particular  type  of  feature  in  abundance.  Our  study  considers  the  special  case  of  four 
points  and  one  line,  which  is  the  minimum  feature  set  required  for  computing  the  structure  and 
motion  parameters,  as  well  as  additional  features  in  an  overdetermined  system  of  equations  to 
improve  the  reliability  and  accuracy  of  the  computation.  The  algorithm's  validity  was  demon¬ 
strated  using  computer  simulation  results  as  well  as  results  from  r^  image  sequences. 
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4.  Autonomous  Navigation. 

a.  Construction  of  an  Autonomous  Mobile  Robot.  Robo-Tex. 

Mobile  robots  are  finding  an  increasing  number  of  applications  in  military  and 
civilian  environments.  Tele-operated  robots,  guided  by  remotely  located  human  operators,  now 
perform  many  operations  in  hazardous  environments  such  as  high-radiation  zones.  Many  more 
potential  applications  exist  in  manufacturing,  surveillance  and  planet^  exploration.  However, 
to  be  truly  useful,  robots  will  need  to  be  more  autonomous  in  perceiving,  understanding  and  re¬ 
sponding  to  the  environment.  Computer  vision— the  automated  understanding  of  video  images 
and  other  sensor  data-is  central  to  the  development  of  robots  that  can  self-navigate  and  perform 
useful  tasks  in  indoor  and  outdoor  environments. 

Although  it  is  fairly  easy  to  link  video  cameras  and  computers,  the  automatic  understand¬ 
ing  of  digitized  images  is  still  a  very  difficult  task.  Traditionally,  researchers  have  used  several 
cameras  in  a  stereo-vision  setup  to  perceive  the  depth  of  objects.  Image  features  are  extracted 
from  each  image  and  matched  with  one  another,  then  depth  is  computed  by  triangulation.  The 
resulting  3-dimensional  representation  of  the  robot’s  environment  is  used  by  path-planning  algo¬ 
rithms  to  navigate  the  robot  and  avoid  obstacles  in  its  path. 

An  innovative  mobile  robot,  Robo-Tex,  has  been  developed  at  the  Computer  and  Vision 
Research  Center  [33-34].  As  pictur^  in  Figure  6,  Robo-Tex  uses  a  single  video  camera  and  per¬ 
ceives  depth  by  tracking  features  over  a  sequence  of  images.  This  simplifies  the  robot  hardware 
and  reduces  the  amount  of  computation  needed  for  image  processing. 

One  common  problem  in  processing  image  data  is  the  selection  of  significant  meaningful 
features  from  the  image  and  the  elimination  of  less  meaningful  features.  In  an  indoor,  man¬ 
made  environment  such  as  an  office  building,  some  prominent  features  that  are  useful  for  navi¬ 
gation  are  the  boundaries,  or  edges,  of  the  walls  and  doorways  that  the  robot  must  identify  in  or¬ 
der  to  navigate  successfully.  In  man-made  environments,  most  of  the  significant  edges  (those 
correspondmg  to  walls  and  doorways)  have  particular  3-dimensional  orientations,  usu^ly  verti¬ 
cal  and  horizontal.  As  described  in  Section  3  above,  we  have  developed  new  perception  algo¬ 
rithms  for  Robo-Tex  that  concentrate  on  such  edges,  thereby  eliminating  many  insignificant  fea¬ 
tures  [21-24]. 

Robo-Tex  relies  on  the  geometrical  properties  of  vanishing  points  to  estimate  the  most 
likely  orientation  of  the  edges  in  each  2-dimensional  image.  For  example,  in  a  3-dimensional 
scene,  vertical  edges  appear  to  converge  to  one  point,  the  vanishing  point  of  vertical  lines.  This 
approach  reduces  die  number  of  unwanted  features,  increases  the  sensitivity  to  useful  features, 
and  drastically  speeds  the  computation. 

While  most  mobile  robots  are  connected  to  large  computers  by  cable  or  radio  links, 
Robo-Tex  carries  an  onboard  HP-735  workstation.  The  workstation  is  easier  to  program  than  the 
special-purpose  boards  usually  used  for  robot  control,  yet  it  has  the  computing  power  required  by 
vision  algorithms.  This  configuration  allows  new  vision  algorithms  to  be  tested  quickly  and  eas¬ 
ily.  Electrical  power  for  the  workstation  and  other  onboard  equipment  is  provided  by  12  V  bat¬ 
teries  and  a  1 10  V  AC  power  inverter.  Standard  equipment  can  be  added  to  the  robot  by  simply 
plugging  it  in. 
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The  ultimate  g(^  of  Robo-Tex  is  to  navigate  autonomously  in  both  indoor  and  outdoor 
environments  while  building  an  accurate  CAD  m<^el  of  the  world.  Already,  significant  progress 
toward  that  goal  has  been  rnade.  The  robot  can  visually  measure  distances  between  edges  with  a 
precision  comparable  to  that  available  from  an  architect’s  plan  and  automatically  generate  com¬ 
plete  models  of  buildings  [32-33].  The  techniques  used  in  this  robot  vision  system  find 
application  in  a  number  of  areas,  including  architecture  and  graphics,  robot  navigation,  active 
vision,  and  scene  understanding. 

The  Robo-Tex  vision  system  uses  a  3D  representation  of  the  environment  that  concen¬ 
trates  on  architecturally  significant  features,  and  which  is  more  accurate  than  is  strictly  necessaiy 
for  navigation.  In  order  to  create  an  architectural  CAD  model  of  the  environment,  the  vision 
systein  must  not  represent  a  large  number  of  insignificant  details.  In  designing  this  system,  the 
objective  was  to  design  all  stages  of  image  interpretation,  including  the  lowest  image  processing 
levels,  to  provide  higher  stages  with  the  most  semantically  useful  features.  The  system  includes  a 
line  segment  detector  (described  above  [18-21]),  an  automatic  tracker,  and  a  CAD  modeler  opti¬ 
mized  for  environments  with  prominent  3D  orientations  [30-31]. 

b.  Position  Estimation  Techniques  for  an  Autonomous  Mobile  Robot. 

The  development  of  truly  autonomous  mobile  robots  is  one  of  the  most  challeng¬ 
ing  and  important  applications  of  computer  vision.  The  basic  tasks  involved  in  robot  navigation 
are  as  follows:  (1)  sensing  the  environment;  (2)  mapping  the  environment,  e.g.,  building  a  rep¬ 
resentation  of  the  environment;  (3)  locating  itself  with  respect  to  the  environment  (position  esti¬ 
mation);  and  (4)  planning  and  executing  efficient  routes  in  the  environment  (path  planning  and 
obstacle  avoidance).  Several  projects  have  been  carried  out  under  this  contract  to  develop  more 
efficient  and  accurate  methods  of  position  estimation.  Position  estimation  techniques  vary,  de¬ 
pending  upon  the  environment  in  which  the  robot  must  navigate  (indoor  or  outdoor),  the  type  of 
robot  sensors  used  (visual,  range,  etc.),  and  the  information  that  is  known  about  the  environment 
(map  representation,  coordnate  systems,  etc.).  Position  estimation  methods  can  be  broadly  clas¬ 
sing  into  four  categories:  landmark-based  methods;  methods  using  trajectory  integration  and 
dead  reckoning;  me^ods  using  a  standard  reference  pattern;  and  meUiods  using  a  priori  knowl¬ 
edge  of  a  world  model  matched  to  sensor  data  fex*  position  estimation.  The  two  projects  de¬ 
scribed  below  fall  into  the  last  category. 

In  methods  that  match  sensor  data  to  a  world  model,  the  model  (or  map)  of  the  environ¬ 
ment  may  be  a  CAD  description  of  the  environment,  a  floor  map,  or,  in  outdoor  terrain,  a  digital 
elevation  map  (DEM).  To  estimate  the  robot’s  position  in  the  environment,  the  robot’s  sensor  ob¬ 
servations  are  match^  to  the  given  map.  Once  a  correspondence  is  established  between  the  sen¬ 
sor  data  and  the  world  model  (map)  data,  the  robot’s  position  and  pose  is  calculated  as  a  co-ordi¬ 
nate  transformation  that  transforms  the  world  model  into  the  sensor  co-ordinate  system. 

Although  mobile  robots  are  equipped  with  wheel  encoders  that  can  estimate  the  robot’s 
position  at  every  instant,  these  estimates  contain  errors  due  to  wheel  slippage  and  quantization 
effects.  As  the  robot  moves,  these  errors  accrue  and  can  grow  limitlessly,  causing  the  position 
estimate  to  become  increasingly  uncertain.  Therefore,  most  mobile  robots  use  addition^  forms 
of  sensing,  such  as  vision,  to  aid  the  position  estimation  process. 

1)  Position  Estimation  of  an  Autonomous  Mobile  Robot  In  guttered  Outdoor 
Environments  Using  Geometric  Visibility  Constraints.  In  this  project,  we  have  considered  the 
position  estimation  of  an  autonomous  mobile  robot  navigating  in  an  outdoor  urban,  man-made 
environment  consisting  of  polyhedral  buildings  with  flat  rooftops  [37-42].  Our  world  model  is 
made  up  of  the  3-D  descriptions  of  the  line  segments  that  compose  the  buildings’  rooftops.  Our 
robot  sensor  is  a  video  camera  mounted  on  the  robot.  Establishing  correspondence  between  the 
image  and  the  map  is  particularly  difficult  in  this  case,  since  they  are  of  different  dimensionality 
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(a  2-D  image  and  a  3-D  map),  in  different  formats,  and  are  described  in  different  coordinate 
frames.  Nevertheless,  optical  imaging  is  still  the  preferred  form  of  sensing  since  it  is  totally  pas¬ 
sive. 


Our  procedure  is  to  extract  features  from  the  image  and  then  search  the  map  to  locate  the 
corresponding  features.  For  a  large  map,  an  exhaustive  search  would  require  an  enormous 
amount  of  computation.  To  reduce  the  computational  cost,  we  limit  the  search  using  visibility 
constra’  *s  imposed  by  the  model's  geometry  and  the  known  camera  geometry.  We  use  a  two- 
stage  constrained-search  strategy.  Stage  One  is  a  coarse  search  that  narrows  the  robot  location  to 
a  small  set  of  possible  locations;  Stage  Two  searches  exhaustively  though  this  set  and  accurately 
establishes  the  robot's  position  in  the  environment. 

The  use  of  visibility  constraints  is  a  novel  method  that  captures  the  geometric  constraints 
between  the  3-D  model  and  the  2-D  image  features  by  using  a  new,  viewer-centered,  intermedi¬ 
ate  representation  of  the  robot's  environment  from  the  given  world  model,  called  edge  visibility 
regions  (EVRs).  The  EVR  inherently  captures  the  geometric  visibility  constraints  between  the 
world  model  and  image  features  by  partitioning  the  ground  plane  (the  plane  in  which  the  robot 
navigates)  into  a  set  of  distinct,  non-overlapping  regions.  Each  region  has  an  associated  list  of  the 
world  model  features  visible  in  that  region,  known  as  the  Visibility  List  (VL),  Thus,  each  EVR  is 
a  region  of  space  with  the  topological  property  that  from  its  points  the  same  set  of  edges  of  the 
model  are  visible  through  a  complete  circular  scan.  These  geometric  constraints  are  compiled 
off-line,  thereby  reducing  the  runtime  of  the  position  estimation  process. 

Once  the  EVR  description  of  the  environment  is  formed  off-line  using  the  given  world 
model  description,  we  use  a  modified  Hough  transform  technique  to  perform  transform  cluster¬ 
ing  and  isolate  the  set  of  EVRs  that  are  most  likely  to  contain  the  robot  location.  In  addition,  we 
can  efficiently  reduce  the  complexity  of  the  search  by  propagating  the  geometric  constraints  es¬ 
tablished  by  the  EVRs  to  correctly  identify  the  robot’s  position.  These  search  techniques  have 
proven  to  be  very  robust  to  image  feature  detection  errors  such  as  missing  features  and  spurious 
features.  The  method  is  also  robust  to  incomplete  model  descriptions  and  inaccurate  feature  rep¬ 
resentation.  Further,  the  EVR  representation  of  »he  environment  is  advantageous  in  the  mobile 
robot's  path-planning  tasks. 

2)  Position  Estimation  Of  An  Autonomous  Mobile  Robot  In  An  Outdoor 
Natural.  Mountainous  Environment.  While  the  problems  associated  with  navigating  mobile 
robots  in  an  indoor  structured  environment  are  reasonably  well  studied  and  a  numl^r  of  different 
approaches  have  been  suggested,  outdoor  navigation  of  a  mobile  robot  in  an  unstructured  envi¬ 
ronment  is  a  more  complex  problem  and  many  issues  remain  open  and  nnsolved. 

To  consider  the  problem  of  estimating  the  position  of  an  autonomous  mobile  robot  navi¬ 
gating  in  an  outdoor  mountainous  environment,  we  assume  that  the  robot  is  provided  with  a  vi¬ 
sual  camera  that  can  be  panned  and  tilted,  as  well  as  a  compass  and  an  altimeter  to  measure  its 
altitude  [43-44].  A  Digital  Elevatioi.  Map  (DEh^  of  the  navigation  area  is  provided.  The  DEM 
is  a  3-D  database  that  records  the  terrain  elevations  for  ground  positions  at  regularly  spaced  in¬ 
tervals.  Our  problem  is  to  find  common  features  to  match  the  2-D  images  intensity  images  from 
the  robot  camera  to  the  3-D  DEM,  and  thereby  estimate  the  robot's  position. 

Our  approach  to  this  problem  is  to  extract  features  from  the  images  and  then  search  the 
map  for  corresponding  features,  (^ce  this  correspondence  is  found,  the  position  can  then  be 
computed.  As  with  the  project  described  earlier,  an  exhaustive  search  of  the  entire  map  would  be 
prohibitively  expensive.  So,  we  formulate  this  correspondence  problem  as  a  constrained  search 
problem.  Since  &e  robot  is  assumed  to  be  located  in  ^e  DEM,  the  DEM  grid  is  used  as  a  quan¬ 
tized  version  of  the  entire  space  of  possible  robot  locations.  The  feature  used  to  search  the  DEM 
is  the  shape  and  position  of  the  horizon  line  contour  (HLC)  in  the  image  plane.  From  the  current 
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robot  position,  images  are  taken  in  the  four  geographic  directions,  N,S,E,  and  W.  The  contour  of 
the  horizon  line  (HLC)  is  extracted  from  these  images  and  coded.  Using  Ae  height  of  the  contour 
line  in  the  image  plane  and  the  known  camera  geometry  as  input  parameters,  the  entire  DEM  is 
searched  for  possible  camera  locations  so  that  the  points  in  the  elevation  map  project  onto  the 
image  plane  to  form  a  contour  of  the  shape  and  height  we  are  searching  for.  Since  searching  the 
DEM  exhaustively  for  the  exact  shape  of  the  horizon  line  is  a  very  computationally  intensive 
process,  we  split  Ae  search  into  two  stages.  In  stage  1,  we  search  using  the  height  of  the  horizon 
line  at  Ae  center  of  the  image  plane  in  all  the  four  images.  Geometric  constraints  derived  from 
the  camera  geometry  and  the  height  of  the  HLC  are  us^  to  prune  large  subspaces  of  the  search 
space;  finally,  the  position  is  isolated  to  a  small  set  of  possible  locations.  In  stage  2,  these  loca¬ 
tions  are  then  considered  as  the  candidate  robot  positions,  and  the  actual  image  that  would  be 
seen  at  these  points  is  generated  using  computer  graphics  rendering  techniques  from  the  DEM. 
The  HLCs  are  also  extracted  from  these  images  and  then  compared  with  the  original  HLCs  to  ar¬ 
rive  at  a  measure  of  the  new  HLCs  disparity.  The  robot  location  corresponding  to  the  lowest  dis¬ 
parity  is  then  considered  as  the  best  estimate  of  the  robot's  position.  As  the  results  show,  the  ap¬ 
proach  is  quite  effective  and,  in  almost  all  cases,  the  position  estimate  is  very  close  to  the  actual 
position. 

c.  Calibrating  a  Mobile  Camera's  Parameters. 

Our  research  has  addressed  the  problem  of  the  calibration  of  the  relative  rotation 
and  translation  between  a  camera  and  a  mobile  robot's  coordinate  system,  as  well  as  the  camera's 
intrinsic  parameters,  from  a  sequence  of  monocular  images  and  robot  movements  [45-46]. 
Existing  hand/eye  calibration  procedures  for  robot  arms  are  not  directly  applicable  because  they 
nKjuire  the  robot  hand  to  have  at  least  two  rotational  degrees  of  freedom.  A  suitable  representa¬ 
tion  for  camera  rotation  is  used,  and  the  calibration  task  is  decomposed  into  two  stages. 
Furthermore,  to  recover  the  camera's  rotation  motion,  the  inverse  p»spective  geometry  con¬ 
straints  of  a  rectangular  comer  are  employed.  Complicated  calibrations  patterns  are  thus  not 
needed.  The  calibration  procedures  were  tested  using  both  synthetic  and  real  data. 

d.  Calibration  Procedure  for  a  Fish-Eve  Lens  fSuper-Wide-Angle  Lens)  Camera. 

Calibration  of  cameras  is  an  important  issue  in  computer  vision.  Accurate  camera 
calibration  is  crucial  in  applications  that  involved  quantitative  measurements,  such  as  3-D  sens¬ 
ing  and  measurement  for  robotic  vision.  Because  they  provide  an  extremely  large  field  of  view, 
rish-eye  lenses  are  useful  in  robotic  vision  for  allowing  close  objects  to  be  viewed  in  their  en¬ 
tirety  and  for  perceiving  objects  that  appear  from  an  unjwedictable  direction.  We  have  developed 
a  new  algorithm  for  the  geometric  camera  calibration  of  a  fish-eye  lens  mounted  on  a  CCD  TV 
camera  [47].  The  algorithm  determines  a  mapping  between  points  in  the  world  coordinate  sys¬ 
tem  and  their  corresponding  point  locations  in  the  image  plane.  The  parameters  to  be  calibrated 
are  the  effective  focal  length,  one-pixel  width  on  the  image  plane,  image  distortion  center,  and 
distortion  coefficients.  A  simple  calibration  pattern  consisting  of  equdly  spaced  dots  is  intro¬ 
duced  as  a  reference  for  calibration.  Some  parameters  to  be  calibrate  are  eliminated  by  setting 
up  the  cdibration  pattern  precisely  and  assuming  negligible  distortion  at  the  image  distortion 
center.  Thus,  the  number  of  unlmown  parameters  to  be  c^ibrated  is  drastically  reduced,  enabling 
simple  and  useful  calibration.  The  method  employs  a  polynomial  transformation  between  points 
in  the  world  coordinate  system  and  their  corresponding  image  plane  locations.  The  coefficients 
of  the  polynomial  are  determined  using  the  Lagrangian  estimation.  The  effectiveness  of  the  pro¬ 
posed  calibration  method  is  confirmed  by  experimentation. 
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CONCLUSIONS. 


The  work  outlined  above  has  demonstrated  amply  that  the  fusion  of  multiple  sensing 
modalities  has  much  to  contribute  to  machine  vision.  We  have  continued  to  make  signiticant  ad¬ 
vances  in  the  development  of  algorithms  and  techniques  for  fusing  laser  radar  and  thermal  im¬ 
ages,  and  have  addressed  one  of  the  most  difficult  areas  of  computer  vision,  namely,  object 
recognition  in  cluttered  scenes.  Significant  strides  have  also  been  made  in  the  development  of 
techniques  for  autonomous  navigation.  This  work  has  been  enthusiastically  receiv^  by  the 
computer  vision  research  community.  Our  results  have  been  presented  at  reviewed  conferences 
such  as  the  IEEE  International  Conference  on  Robotics  and  Automation,  the  IEEE  International 
Workshop  on  Intelligent  Robotics  and  Systems,  and  the  IEEE  Conference  on  Computer  Vision 
and  Pattern  Recognition,  and  published  in  refer^  journals  including  Pattern  Recognition,  IEEE 
Transactions  on  Pattern  Analysis  and  Machine  Intelligence,  Machine  Vision  and  Applications, 
Computer  Vision,  Graphics  and  Image  Processing:  Image  Understanding,  and  IEEE 
Transactions  on  Robotics  and  Automation.  Book  chapters  based  on  this  research  have  appeared 
in  the  following  books:  Multisensor  Fusion  for  Computer  Vision',  The  Handbook  of  Pattern 
Recognition  and  Computer  Vision;  Encyclopedia  of  Artificial  Intelligence;  Control  and  Dynamic 
Systems:  Advances  in  Theory  and  Applications,  Volume  39:  Advances  in  Robotic  Systems; 
Parallel  Processing  for  Artificial  Intelligence,  and  Autonomous  Mobile  Robots:  Perception, 
Mapping,  and  Navigation. 

Three  papers  based  on  research  under  this  contract  have  received  distinguished  awards, 
including  "Multi-Sensor  Image  Interpretation  Using  Laser  Radar  and  Thermal  Images,"  (IEEE 
Computer  Society  Outstanding  Paper  Award,  7th  Conference  on  Artificial  Intelligence 
Applications  (1991)  [1]),  "Extraction  and  Interpretation  of  Semantically  Significant  Line 
Segments  for  a  Mobile  Robot,"  (Phillips  Award  for  Best  Paper  at  the  IEEE  Computer  Society 
International  Conference  on  Robotics  and  Automation,  Nice,  France  (1992)  [21]),  and  "Applying 
Perceptional  Organization  to  the  Detection  of  Man-made  Objects  in  Non-Urban  Scenes," 
(Honorable  Mention  of  the  Pattern  Recognition  Society  Award  for  Outstanding  Contribution, 
November  1993,  [8]).  A  total  of  13  graduate  and  4  undergraduate  students  were  supported  under 
this  contract,  and  6  Ph.D.  and  1  M.S.  degrees  were  completed  during  the  contract  term. 

Although  we  have  made  significant  progress  in  the  fusion  of  multiple  sensing  modalities 
for  machine  vision,  the  development  of  truly  general-purpose  machine  vision  systems  that  are 
capable  of  true  autonomy  in  sensing,  understanding,  and  responding  to  their  environment  re¬ 
mains  a  distant  objective.  To  furdier  progress  toward  that  goal,  we  must  examine  the  techniques 
employed  by  human  vision,  which  is  significantly  better  than  machine  vision  at  detecting  and 
recognizing  objects  because  human  vision  integrates  information  hrom  a  number  of  sources  and 
apparently  uses  a  number  of  different  "algorithms."  Current  approaches  to  computer/machine 
vision  use  a  number  of  different  techniques,  including  statistical  pattern  recognition,  model- 
based  vision,  neural  networks,  knowledge-based  artificial  intelligence  (rule-based  systems),  and 
adaptive  and  learning  systems.  Despite  much  investigation,  machine  vision  techniques,  includ¬ 
ing  image  segmentation  and  multisensor  fusion,  have  had  only  limited  success  in  identifying 
three-dimensional  objects  in  cluttered  environments.  One  possible  means  of  improving  perfor¬ 
mance  is  to  integrate  multiple  vision  techniques  into  one  system.  For  example,  model  based  vi¬ 
sion  techniques  are  effective  in  situations  where  the  models  of  both  the  objects  to  be  reco^ized 
and  the  environment  are  well  understood.  On  the  other  hand,  an  approach  based  on  artificial 
neural  networks  (ANNs)  is  effective  when  model-based  knowl^ge  is  unavailable  or  the  number 
of  variables  is  so  large  as  to  make  a  model-based  strategy  unworkable.  An  intelligent  hybrid  vi¬ 
sion  system  that  combines  these  two  techniques  could  be  successful  in  situations  in  which  sys¬ 
tems  based  only  model-based  vision  or  only  ANNs  would  fail.  Further,  not  only  can  humans  in¬ 
tegrate  information  despite  irrelevant  and  incomplete  information,  they  can  integrate  spatial  and 
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temporal  information,  as  well  as  integrate  contextual  information  with  model-based  knowledge. 
This  suggests  that  the  integration  of  temporal  and  spatial  information  could  further  assist  such  a 
hybrid  vision  system.  The  superiority  of  human  vision  appears  to  lie  in  its  use  of  multiple 
sources  of  information  and  processing  techniques,  leading  us  to  believe  that  a  vision  system  us¬ 
ing  a  similar  strategy  to  combine  model-based  vision  with  ANNs  on  spatial/temporal  images 
should  be  investiga^. 

Vision  systems  capable  of  recognizing  3D  objects  in  a  natural  environment  often  en¬ 
counter  significant  problems,  including  noise,  occlusion,  low  contrast,  low  resolution,  and  back¬ 
ground  clutter.  Given  the  ^owledge  we  have  today,  this  problem  may  not  be  immediately 
solvable.  However,  through  further  exploration  of  combining  existing  techniques,  substantid 
progress  toward  that  goal  can  be  made. 
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